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Abstract 

Essay testing has traditionally served as a means by which students' under- 
standing of structural interrelationships between concepts could be evaluated 
and promoted. However, essay test scores frequently suffer from poor relia- 
bility and validity, and the time required to evaluate essay test respo; ses 
may be prohibitive in large classes, fhis paper discusses an alternative 
procedure for measuring and promoting structural understanding which overcomes 
many of the problems encountered when using essay tests. In Experiment I, 
numerical judgments of the strength of relationships existing between concepts 
from three units of undergraduate general psychology were obtained from 
between 81 and 103 students. Both the reliability with which th^^se judgments 
were made and the similarity between students' judgments and those cf the 
instructor were found to be significantly correlated to students' essay test 
scores over the same three units. Previous to this research, judgment 
reliability has served only as a criterion by which relationship judgment 
data was judged acceptable or unacceptable for further analysis through 
multidimensional scaling (MDS). Importantly, MDS analysis of relationship 
judgments was not necessary in order for these judgments to serve as an 
effective means of evaluating structural understanding in the present research. 
Consequently, only those relationship judgments on which experts show strong 
agreement need be considered. Multidimensional scaling analysis, in contrast, 
requires that all judgments, even those which may be unstable, be obtained 
and utilized In evaluations of cognitive structures. Experiment II demonstrated 
that restricting relationship judgments to these high-reliability concept 
pairs Increased the accuracy of predictions of essay scores from predictor 
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measures derived from relationship judgments. The use of relationship judgments 
may provide an alternative to essay testing in situations in which essay testing 
ii ImprcPctical . 
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Relationship Judgments and Multidimensional Scaling in the 
Measurement of Structural Understanding 

The widespread use of essay examinations is based on the premise that 
essay examinations are capable of testing different Kinds of knowledge than 
can be easily tested through multiple-choice or true-false examinations. 
Indeed, there is support for this proposition. It is well established that 
the free recall of information demanded by essay examinations requires consider- 
ably greater integrative rehearsal (Lindsay & Norman, 1977) or depth-of- 
orocessing (Craik & Tulving, 1975; Craik & Watkins, 1973; Woodward, Bjork, & 
Jongeward, 1973) during learning than is needed in order for recognition to 
occur, Mttltiple-choice and true-false examinations usually test only recogni- 
tion. Since integrative rehearsal or deep processing involves processing 
the meaning of information, it follows that essay examinations more clearly 
test students' comprehenf ion than do recognition tests such as multiple-choice 
and true-false. Bloom (1956) and others (e,g., Ayers, 1966; Billeh, 1974; 
Gall, 1970; Roberts, 1976; and Scriven, 1967) have also argued that essay 
testing can more easily be used in evaluating "higher levels" of understanding 
(specifically, the structural interrelationships and implications of a domain) 
than can multiple-choice or true-false tests. Yet another reason for the use 
of essay examinations is the observation that students who anticipate examina- 
tions which focus on higher levels of understanding seem to study and learn in 
a qualitatively different fashion than do students who expect multiple-choice 
or true-false examinations. That is, a teacher may guide learning through 
a careful manipulation of students* expectancies concerning examinations, 
Doak (1970), Ladd & Anderson (1970), and Willson (1973) have ell shown that 
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the level of a teacher s questions and discourse influences student's level . 
of discourse arJ performance on examinations designed to assess several levels 
of understanding. 

While essay examinations are helpful in evaluating and promoting higher 
levels of understanding, the utility of this test'''ng method is limited. 
Chase (1968), Linn, Klein, & Hart (1972), Marshall (1967, 1972), Marshall & 
Powers (1969), and Scannel (1966) have pointed to the influence of a variety 
of extraneous characteristics of essay responses that influence grades, 
including neatness, spelling errors, and grammatical errors. In addition, 
the quality of previously graded essay responses can influence the scoring 
of subsequent responses through a contrast effect. Positive and negative halo 
effects Established through a scorer's prejudices also may influence essay 
grades. Clearly, essay exam scores are influenced by many factors other than 
the quality of information contained in essay responses, and thus, may suffer 
from poor reliability and validity. While these difficulties may be overcome 
to some extent through careful control of essay scoring procedures, it is not 
so simple a matter to eliminate another disadvantage to the use of essay 
examinations: the time required to adequately score essay responses from large 
classes. 

Alternative methods of assessing higher-level, structural understanding 
have been Investigated, presumably in the hope of finding a less time-consuming, 
more objective testing method. Word-association and graph-construction methods 
are among these alternatives (Johnson, 1967, 1969; Preece, 1976; Shavelson, 
1972, lt73, 1974), but both are more time consuming to score than are essay tests. 

Another approach to evaluating understanding of the structural interrelation- 
ships between concepts In a domain has been multidimensional scaling (MDS) of 
students' similarity or relationship judgments. Here, students use a numerical 
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scale in judging the strength of relatedness or similarity within all possible 
pairs of a selected set of concepts. These judgments are considered to be 
equivalent to distances between the concepts and are used in constructing a 
graphic array of points in space (sometimes called a "cognitive map") that 
best captures the pattern of relationship judgments considered as a whole. 
Cognitive maps formed through MDS reflect some important aspects of a student's 
structural understanding of the domain from which the concepts were sampled. 
For example, Johnson, Cox, & Curran (1970) obtained similarity ratings for 
all possible pairs of six concepts from mechanical physics. The similarity 
judgments were averaged across students and analyzed through MDS. The MDS-. 
produced map was found to be very similar to arrays constructed through 
logical examination of the mathematical relationships that exist between the 
six concepts. Thus, MDS analysis of similarity judgment data was shown to 
capture mathematical relationships that exist between concepts in mechanical 
physics. 

Weijj^r & Kaye (1974) also validated MDS-produced cognitive maps as a 
measure of structural understanding. Sixteen concepts from developmental 
psychology were scaled by 45 undergraduates before and after receiving instruction 
in that topic area. Averaged similarity judgments were analyzed through MDS 
and the cognitive map obtained was compared to a map based on the instructor's 
similarity judgments. The student map obtained following instruction was found 
to be sigrificantly more similar to the instructor's map than was the map formed 
prior to Instruction. 

Fenker (1975) also reported a study in which MDS-produced cognitive maps 
were evaluated as a measure of structural understanding. At the beginning of 
an undei'graduate course In statistics and research design, 27 students wore 
given a list of 21 concepts to which they were told they should give special 
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attention. Following completion of the appropriate course units, students 
rated the relatedness of these 21 concepts. Several of the concept pairs 
were repeated and judgments obtained from these repeated pairs were correlated 
with initial judgments for those pairs, enabling assessment of each student's 
reliability in making the relationship judgments. Ten of the students showed 
reliability coefficients of less than .50 and their dati was gi^en no further 
consideration. Relationship judgments from the remaining 17 students were 
analyzed through MDS and these student maps were compared to a map based on 
averaged relationship judgments obtained from a panel of eight experts 
(faculty members and 9raduate students). A significant correlation was found 
Detween students' course grades and an index of overall similarity between 
the maps of students and the expert map. There was also a high degree of 
similarity between maps of the individual experts. 

In sum, efforts aimed at developing MDS of relationship judgments as a 
method of evaluating and promoting structural understanding have been promising. 
Additional work, however, is needed. Relationship judgments carry all of the' 
information which subsequently appears in MDS-produced cognitive maps, yet 
characteristics of these original, unanalyzed relationship judgments have not 
been examined as measures of structural understanding. In addition, previous 
work has focused exclusively on the relationship between students' test scores 
and the similarity between their cognitive maps and those of the instructor. 
Other characteristics of students' relationship judgments, such as judgment 
reliability, may also reflect structural understanding, but have not been 
examined. The purpose of the first experiment was to examine the relationship 
between characteristics of students' original, unanalyzed relationship judgments 
and their performance on essay examinations covering the content of three 
units of an undergraduate general psychology course. The second experiment 
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sought to determine whether or not the strength of this relationship could be 
enhanced by having students judge only concept pairs on which several experts 
showed high levels of agreement. 

EXPERIMENT I 
Method 

Subjects 

Students enrolled in five sections of an undergraduate general psychology 
course participated in Experiment I. All sections were taught by the same 
instructor using similar lecture notes and identical testing materials. The 
five sections were taught over a period of three semesters, but data from the 
five sections have been combined. Data from three examinations were examined 
in this study. The number of students who completed each of the three 
examinations was 103, 81, and 88 for exams 1, 2, and 3, respectively. The 
niinber of students varied as a consequence of attrition and absences. 
Procedure 

Students were tested on three separate occassions over information pre-- 
sented in three units of the general psychology course (biopsychology, develop- 
mental psychology, and social psychology). Each of the three examinations 
consisted of four parts: (a) a 40-minute, 50-item multiple-choice test which 
focussed on definitions of single concepts; (b) a 15-minute relationship 
judgment test in which students were instructed to assign a number between 1 and 
9 to each of 45 pairs of concepts to indicate how strong a relationship they 
thought existed between the concepts in that pair. A rating of 9 indicated that 
the two concepts were viewed as very highly related, while a rating of 1 
indicated that the two concepts wer'^ felt to be unrelated; (c) a second set of 
nine concept pairs sampled from the larger set which students were instructed to 
rate again; and, (d) a 20-m1nute essay test which called for students to discuss 
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the nature of the relationships within each of three pairs of concepts selected 
by the instructor. 

C 

Each section of each examination was collected prior to administering the 
subsequent section. Students were told that their course grades would be based 
75% on multiple-choice performance, Zb% on essay performance, and that 
"borderline grades" would be decided through an examination of the "reasonable- 
ness" of their relationship judgments. Since the purpose of this study was to 
examine the relationship between students' strength-of-re!ationship judgments 
and their performance on essay examinations of structural understanding, only 
relationship judgments and essay exams will be given further consideration here. 
Scoring and Reliability of Essay Tests 

Essay tests were scored by the course instructor using guidelines designed 
to improve scoring reliability. These guidelines indicated the sorts of 
information that should be included in each response and how many points would 
be given for each such piece of information. Likely errors were listed and 
penalties associated with these errors were noted. Each essay item was scored 
from 0 to 5 points, with 0 assigned in the case of no response, and 5 assigned 
to a perfect response. The average of each student's item scores served as 
that student's essay test score. An attempt was made to maintain anonymity in 
the scoring procedure and there were very few unavoidable exceptions to this 
rule. Tests were scored i tent- by- item, rather than stitdent-by-student, so as 
to reduce the influence of halo effects operating within tests. Tests and 
test items were scored in a different order on a second occassion one week 
following the first scoring. Correlations were commuted between item scores 
obtained on these two occassions. Alt^ items yielded scoring reliability 
(foefflclents of .80 or greater. Jotal scores 5»nowed standard error values of 
.57, .71, ^nd .42 points f6r exams 1, 2, and ?, respectively. These values 
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represent the average absolute differences between essay test scores assigned 

on the two scoring occassions. 

Measures Obtained from Relationship Judgments 

Two measures were obtained from each student's relationship judgments. 
First, judgments obtained for the nine duplicated concept pairs were correlated 
to the original judgments for these concept pairs. These correlations served 
as an index of test-retest judqment reliability for each student. While thil^ 
reliability coefficient is usually used in eliminating unreliable data (e.g., 
Fenker, 1975), it was examined in the present study as a potential measure of 
students' structural understanding. The logic was that students who do not 
understand the interrelationships between concepts in an area will base their 
relationship judgments on guesswork, and thus, will show less stability in 
judgnents obtained on repeated occassions. The second measure extracted from 
each student's relationship judgments was a correlation between the student's 
judgments and those of the Instructor. This correlation served as an index of 

, similarity between relatlonstiip judgments of students and those of the instructor. 
Previous research has focussed on the similarity between students' MDS-produced 
cpgnltive maps and the maps of their instructors. The correlation between 
student and instructor judgments was examined in the present study as a simpler 
and quicker alternat"*ve to MDS analysis followed by complex comparisons between 
students' solutions and that of the instructor. 
Relationship Judgments as Measures of Structural Understanding 

In evaluating students' strength-of-relatlonship judgments a^ indicators 
ot structural understanding, the two variables described above (to be referred 
to subsequently as "re'.lab" and "corr") were entered as predictor variables 

Mnto stepwise multiple regression analyses in predicting students' essay test 
scores. Three separate analyses were completed, one for each. of the three 
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essay tests. 

Results 



The results of these stepwise multiple regression analyses are sunmarized 
in tables 1 and 2. Scores on all three essay tests were predicted to a 



Insert tables 1 and 2 about here 

statistically significant extent by linear combinations of "reliab" and "corr." 
In two out of three cases (tests 2 and 3), "reliab" was a stronger predictor of 
essay test scores than was "corr." 

In evaluating the accuracy of the predictions made through combinations of 
"reliab" and "corr," essay test scores predicted for each subject were compared 
with students' obtained essay test scores. Correlations between these two sets 
of scores (multiple correlations) were .49 (£ (2, 100) = 15.61 £ < .001), 
M (F (2, /8) - 9.20, £ < .001), and .46 (F (2, 85) = 11.15 £ < .001) for test? 
1, 2, and 3, respectively. The average absolute differences between pr-edicted 
and obtained essay scores (the standard errors o-"" estimate) were .85, .96, and 
.83 points on tests ^^ 2, and 3. It may be recalled that essay west scores 
could vary from 0 to 5 points. 

In considering the sizes of these errors in prediction, they may be 
compared to differences found between essay test scores obtained on the two 
scoring occassions (.57, .71, and .42 points for tests 1, 2, and 3, respectively). 
While measures derived from relationship judgments do not perfectly predict 
essay test scores, essay test scores assigned on one occassion also fail to 
perfectly predict essay test scores assign'^d one week later. 

It Is possible to Identify several factors which can account for $u«ie of 
the error made in predicting essay scores from "reliab" and "corr." First, 
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the lack of perfect reliability (i.e., unpredictability) in essay test scores 
limits the degree to which t^ose scores can be predicted. Second, while essay 
test items required students to discuss the nature of the relationships between 
tnree concept pairs, the relationship judgment .ests required students to 
consider relationships existing in 45 pairs of concepts. Therefore, essay test 
scores were based on a different, considerably narrower domain than were 
measures derived from relationship judgments. Since the two types of tests 
covered different material, the correlation between essay test scores and 
measures based on relationship judgments cannot be expected to be perfect. 
Third, because students were told that relationship judgments would only be 
used in deciding borderline grades, there was little incentive for them to 
expend much effort in comple'^ing the judgments. This undoubtedly introduced 
some error into relationship judgments which in turn weakened the correlation 
between relationship judgments and essay test scores. Finally, relationship 
judgments were obtained for ll possible pairs of concepts in each unit. 
While a complete set of judgments such as this is necessary for MDS analysis, 
the simpler correlational approach followed in this research does not require 
that all pairs of concepts be judged. It may be assumed that in Experiment I 
some of the concept pairs judged for relationship w^re judged less reliably, 
even by the instructor, than were other pairs. Thus, students* judgments were 
compared for similarity to a set of instructor-generatpd judgments which were 
somewhat unstable and thus provided a less than ideal standard for comparison. 

EXPERIMENT II 

The purpose of the second experiment was to further examine the utility 
of testing structural understanding through relationship judgments. Several 
of the situational characteristics of Experiment I which may have weakened 
the relationship between measures derived from relationship judgments and 
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essay test scores were eliminated in Experiment II: (a) performance on 
relationship judgments contributed 15% towards each exam grade (multiple choice 
contributed 60% and essay scores contributed 26%); and, (b) relationship 
judgments were obtained only for concept pairs on which the top 10 students of 
previous semesters (based on overall course performance) showed relatively good 
agreement (specifically, only concept pairs with standard deviations of 2.0 or 
less were used in Experiment 11) and the median of their judgments servea as 
the "expert" judgment pattern against which the judgments of students partici- 
pating in Experiment II were compared for similarity. (It should be noted that 
a very high degree of similarity was observed between these median expert 
judgments and the j'idgments of the instructor). This change in the selection 
of concept pairs reduced the number of pairs to be judged from 45 per exam in 
Experiment I to 15, 20, and 14 in Experiment II on exams 1, 2, and 3, re<;pective- 
ly. Because the number of pairs was so reduced, it was possible to administer 
the entire set of judgments on two occassions in evaluating judgment reliability. 
It will be recalled that 1n Experiment I, only nine concept pairs were repeated 
In evaluating judgmen' ^bility. 
Subjects 

Between 36 and 38 students enrolled in an undergraduate general psychology 
course participated in Experiment II. The course was taught by the same 
Instructor involved in Experiment I using the same lecture materials and text, 
and testing procedures were the same as described in Experiment^, except is 
noted above. The number of students who Qompleted each of the three examinations 
was 38, 38, and 36 for exams 1, 2, and 3, respectively. The number of students 
completing the exams varied as a consequence of attrition. 
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Procedure 

Procedures followed during Experiment II were essentially the same as 
those of Experiment I, with the modifications associated with relationship 
judgments that have been noted previously. Measures obtained in Experiment II 
were Identical to those of Experiment I, except that judgment reliability 
("rellab") was based on completion of the full set of judgments twice and the 
similarity between each student's judgments and those of the panel of top 
students ("corr") was determined by computing a correlation between each 
student's averaged judgments and the mediar. judgments obtained from the 10 top 
students. 

Stepwise multiple regression analyses were again used in evaluating the 
ability of "reliab" and "corr" to predict students' essay scores. 

Results 

The results of these stepwise multiple regression analyses are sumnarized 
In tables 3 and 4. Essay scores from Exam 2 and 3 were significantly predicted 



Insert tables 3 & 4 here 

by linear combinations of "rellab" and "corr," although essay scores from exam 
1 were not. The multiple correlations between "rellab" and '*corr" and assay 
scores were .28 (F (2. 35) = 1.44, £= n.s.), .61 (F (2, 35) = 10.17, £< .005) 
and .76 (F (2, 33) = 22.32, .001) for exams 1, 2, and 3, respectively. 
These multiple correlations may be compared to the values obtained in Experiment 
.49, .44, and .46, for exams 1, 2, and 3. 
The average absolute differences between obtained and predicted essay scores 
(the standard errors of estimate) were .87, .85, and .70 for exams 1, 2, and 3, 
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compared to the standard error values from Experiment I: .85, .96, and .83 
points. It should again be remembered in interpreting the sizes of these 
errors that essay scores could vary from 0 to 5 points and that the essay 
scores were themselves not totally reliable. 

In comparing the predictive accuracy of Experiment II to that found in 
Experiment I, a general Tendency towards greater predictability of essay 
scores in Experiment II is seen. In Experiment I, the average percentage of 
essay score variance predicted across the three exams was 22% compared to 34% 
in Experiment II. The average :.tandard error of the estimate across the 
three exams of Experiment I was .88 points, while the average standard error 
of the estimate for Experiment II was .81 points. 

In further comparing results from the two experiments, the predictive 
contribution of "corr'* increased from Experiment I to Experiment II, perhaps 
because the "expert" judgments were more stable, being based on only highly 
reliable concept pairs. The predictive contribution from "reliab" remained 
approximately the same as in Experiment I. 

Discussion 

Measuring and promoting students* understanding of the structural 
interrelationships between concepts calls for the use of testing procedures 
which focus on these interrelationships. Essay testing has traditionally been 
jsed for this purpose, but suffers from several weaknesses, not the least of 
which Is the great time required to reliably score essay exams administered in 
large classes. Testing knowledge of Interrelationships by having students 
rate the perceived strength-of-relationship between concepts presented in pairs 
was e/amined in the present study as an alternative to essay testing for use in 
those situations in which essay testing is impractical. 



16 



structural Understanding 

15 



In contrast to previous research in this area, the present study examined 
students* original relationship judgments, rather than MDS-analyzed judgnents; 
both the similarity between students' and the instructor's judgments and the 
reliability of students' judgments were examined as predictors of structural 
understanding as measured by esi»ay examinations; and the study used data 
collected from a large jiunber of students replicated over three separate 
exan^inatlons. 

It was found from Experiment I thui the reliability of students' 
relationship judgments were significantly related to essay test scores. This 
finding is important because judgment reliability has previously been uscd as 
a criterion for accepting or discarding relationship judgment data (e.g., Fenker, 
1975), whereas it was found in Experiment I to be an important indicator of 
structural understanding. The present study also showed that students' 
relationship judgments .do not need to be analyzed through multidimensional 
scaling prior to comparison to expert judgments in order for them to be useful 
in evaluating structural understanding. Pearson product-moment correlations 
between each student's judgments and those of the instructor of panel of 
"experts" j^rovided statistically significant prediction of students' essay 
scores. The importance of this finding is twofold. First, previous research 
which has atte*npted to measure structural understanding through relationship 
judgments has always involved MDS analysis of relationship judgnents followed 
by comparisons of MDS solutions for similarity. These procedures are far more 
difficult and time consuming than the simple correlational approach taken in 
the present research. Second, MDS analysis of relationship judgments reqiiires 
a complete matrix of judgments, i»e., all concept pairs must be judged. Even 
experts may hav? difficulty generating reliable judgments for some of these 
pairs, and the "Ideal" MOS solutions produced from their judgments will necessar- 
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ily reflect this poor reliability, thus providing a standard of excellence which 
is less than perfect* In contrast, if students' judgments are compared 
correlational ly to those of the instructor or expert, it is not necessary to 
obtain judgments on all possible concept pairs. Only concept pairs which 
experts can reliably judge for relatedness need be considered, thus providing 
a superior standard against which student judgments are compared for similarity* 
Experiment II which utilized only these high-reliability concept pairs 
demonstrated an increase in the predictive importance of similarity betv;een 
students' end experts' judgments* Overall, the use of these high-reliability 
pairs resulted in enhanced prediction of essay performance, relative to that 
found in Experiment L 

In conclusion, in situations in which essay testing is impractical, the 
use of testing through relationship judgments may be a reasonable alternative 
method of assessing and promoting students' structural understanding* 
Characteristics of students' relationship judgments are significantly 
correlated to traditional essay measures of structural understanding; 
evaluation of students' numerical relationship judgments is more objective 
than is the evaluation of essay responses; a wider range of knowledge can be 
assess^ through relationship judgment:^ in a limited amount of time; and the 
coTiputerized analysis of students* relationship judgments creates an enormous 
time saving over the scoring of essay responses* 
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Table 1 



Structural Understanding 

20 



Correlations between essay test scores (essay), relationship 
judgment reliability coefficients (rellab), and similarity between 
students' and instructor's relationship judgments (corr). 

EXPERIMENT I 

Exam 1 Exam 2 Exam 3 

(N = 103) (N = 81) (N = 88) 

Essay Reliab Corr Essay Reliab Corr Essay Reliab Corr 

1.00 .16 .49*** 1.00 .39*** .31** 1.00 .44*** .21* 

1.00 .36*** 1.00 .30** 1.00 .22* 

1.00 1.00 1.00 



*** £ < .001 
** £ < .01 
* £ < .05 
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Table 2 

Stepwise multiple regression sunmary tables; Prediction of essay scores 
from relationship judgment reliability (reliab) and^ 
similarity between students' and instructor's relationship judgments (corr) 

EXPERIMENT I 



E..dm 1 
(N = 103) 



Variable Multiple 
Step Selected R 



df 



Std. 
Error 



Exam 2 

(N = 81) 

Variable Multiple 
Step Selected R df 



Std. 
Error 



1 Corr .49 1, 101 31.49*** .85 

2 Reliab .49 2, 100 15.61*** .85 



1 Reliab .39 1, 79 13.77*** .98 

2 Corr .44 2. "'3 9.20*** .96 



Exam 3 
(N = 83) 

Variable Multiple Std. 
Step Selected R df F Error 

1 Reliab .44 1, 85 20.56*** .84 

2 Corr .46 2, 85 11.15*** .83 



*** £ < .001 



/ 



4 • B » 



structural Understanding 



22 



Table 3 

Correlations between essay test scores (essay), relationship 
judgment reliability coefficients (reliab), and similarity between 
students' and experts' relationship judgments (corr). 

EXPERIMENT li 



Exam 1 
(N = 38) 

Essay Reliab Corr 

Essay LOO .24 .16 

Reliab 1.00 13 

Corr 1.00 



Exam 2 
(N = 38) 
Essay Reliab Corr 
1.00 .53*** .48** 
1.00 .37* 
1.00 



Exam 3 
(N = 36) 
Essay Reliab Corr 
l.Od' .28 .76*** 
1.00 .43** 
1.00 



*** £ < .001 
** £ < .01 



£ < .05 
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Table 4 

Stepwise multiple regression summary tables; Prediction of essay scores 
from relationship judgment reliability (reliab) and 
similarity between students' and experts' relationship judgments (corr). 

EXPERIMENT II 



u 



to 



Exam 1 
(N = 38) 



Variable Multiple Std. 

Step Selected R df , £ Error 

1 Reliah .24 1, 36 2.26 ' .88 

2 Corr .28 2, 35 1.14 .87 



Exam 2 

(N = 38) 

i/ariable Multiple 
Step Selected R df 



Std. 
Error 



1 Reliab .53 1 , 36 13.81*** .91 

2 Corr .61 2, 35 10.17** .85 



Exam 3 
(N = 36^ 



Variable Multiple 

R df 



Step Selected 

1 Corr 

2 Reliab 



Std. 
Error 



.76 1, 34 45.63*** .70 
.76 2, 33 22.32*** .70 



\ 

*** £ < ;001 
** £ < .005 



